In recent years, neural networks have grown remarkably in their variations and range of application. This has been driven by increases in both the available training data and also in compute power.
Neural networks can interpolate high dimensional data in a flexible, adaptable way. Although each 'neuron' typically uses a simple (non-linear) function to map input to output, the multiple neurons in a layer, and then the multiple layers in a network, give neural networks enormous scope.
A multi-layered neural network consists of an input layer $\mathbf{x}$, and output layer $\mathbf{y}$ and one or more hidden layers, $\mathbf{z_i}$. The neurons in each layer can be connected to some or all of the data available at the previous layer. Each neuron applies an activation function to a weighted sum of the inputs. As an example, the second hidden layer $\mathbf{z_2}$ is related to the first hidden layer $\mathbf{z_1}$,
\begin{equation} \mathbf{z_2} = f (\mathbf{A}, \mathbf{z_1})\quad\text{,} \end{equation}where $f$ is the activation function and $\mathbf{A}$ is a matrix of weights and biases. For a particular neuron, the output will be given by,
\begin{equation} z_{2j} = f\Biggl( \sum_i w_i z_{1i} + b_j \Biggr)\quad\text{,} \end{equation}where the summation is over the $i$ input values of $\mathbf{z_1}$, $w_i$ is the weighting applied to the $i$th input, and $b_j$ is the bias. In training the neural network, we seek optimum values for all the weight and biases (the parameters) of the network.
The activation function ($f$ in the above equation) is usually a simple, differentiable, non-linear function. Common choices are the tanh function and the so-called rectified linear unit (ReLU) which is linear for a positive input, and zero otherwise.
We can use Python to plot some common activation functions:
import torch # we will use pytorch to build our neural network
from sklearn.metrics import r2_score
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.linspace(-5,5,100)
y1 = x # linear
y2 = 1./(1+np.exp(-x)) # logistic
y3 = np.tanh(x) # tanh
y4 = np.clip(x,0,None) # rectified linear unit (ReLU)
plt.figure(figsize=(20,3))
plt.subplot(1,4,1)
plt.plot(x,y1)
plt.title('Linear')
plt.subplot(1,4,2)
plt.plot(x,y2)
plt.title('Logistic')
plt.subplot(1,4,3)
plt.plot(x,y3)
plt.title('Tanh')
plt.subplot(1,4,4)
plt.plot(x,y4)
plt.title('ReLU');
In training the network, we seek to minimise the error $E$ by changing the parameters in our matrices $\mathbf{A_j}$. If we define the error as a sum of the squares, we could write this as
\begin{equation} \underset{\mathbf{A_j}}{\text{argmin}}\: E(\mathbf{A_1}, \mathbf{A_2},\dots, \mathbf{A_M}) = \underset{\mathbf{A_j}}{\text{argmin}}\: \sum_{k=1}^n \Bigl( f(\mathbf{x_k}, \mathbf{A_1}, \mathbf{A_2},\dots, \mathbf{A_M} ) - \mathbf{y_k} \Bigr)^2 \end{equation}where $f$ is the total input-to-output function created by the network and $\mathbf{y_k}$ is the truth.
To see how this optimisation can be done, we now consider a simple one node, one hidden layer network.
When the error $E$ is minimised with respect to our network parameters ($a$ and $b$ in this case), we will have,
\begin{equation} \frac{\partial E}{\partial a} = \frac{\partial E}{\partial b} = 0\quad\text{.} \end{equation}If we write,
\begin{equation} E = \frac{1}{2}(y_0-y)^2\quad\text{,} \end{equation}where $y$ is the output of the network and $y_0$ is the truth, then, by the chain rule,
\begin{equation} \frac{\partial E}{\partial a} = -(y_0 - y)\frac{dy}{dz}\frac{dz}{da}\quad\text{.} \end{equation}The derivatives $dy/dz$ and $dz/da$ can be computed from the activation functions $f$ and $g$. Using the chain rule, we can see that the derivatives we require move upstream in the network, from output to input. This is called back propagation and is a key concept in neural networks.
Once we have the gradient in error with respect to a particular weight, we may update that weight using,
\begin{equation} a_{\text{new}} = a_{\text{old}} + \delta\frac{\partial E}{\partial a}\quad\text{,} \end{equation}where $\delta$ is the learning rate.
As the number of trainable parameters in our matrices $\mathbf{A_j}$, and the number of data points $n$, increases, using this gradient descent approach to update the parameters becomes intractable. Stochastic gradient descent uses a random subset, or batch, of the total $n$ points that are available. Having selected a batch, the errors and gradients (by back propagation) are computed based on the results of applying the network to that batch. The network parameters are then updated, a new batch selected, and the process repeats.
Pytorch is a widely used framework for developing Machine Learning models in Python. We will use it to develop a multi-layer neural network (2 hidden layers) to connect one output variable to six input variables.
We first create a class
called Net
which initialises the arrays (in the __init__
function) and defines the structure of the neural network (the forward
function).
class Net(torch.nn.Module): # Define "Net" as a subclass of the torch.nn module
def __init__(self): # When we create an instance of Net, the _init_ function is called
super(Net, self).__init__() # This line means that when we create a Net and run _init_ , we also run the _init_ of the torch.nn Class
self.hid1 = torch.nn.Linear(6, 10) # Set the size of hidden layer 1: 6 inputs x 10 outputs
self.hid2 = torch.nn.Linear(10, 10) # Set the size of hidden layer 2: 10 inputs x 10 outputs
self.oupt = torch.nn.Linear(10, 1) # The output layer is of size: 10 inputs x 1 output
torch.nn.init.xavier_uniform_(self.hid1.weight) # initialise the weights (Xavier uniform distribution)
torch.nn.init.zeros_(self.hid1.bias) # initialise bias (zero)
torch.nn.init.xavier_uniform_(self.hid2.weight)
torch.nn.init.zeros_(self.hid2.bias)
torch.nn.init.xavier_uniform_(self.oupt.weight)
torch.nn.init.zeros_(self.oupt.bias)
def forward(self, x): # Now define the NN:
z1 = torch.tanh(self.hid1(x)) # z1 is output from the tanh activation function applied to the result of hid1 on x
z2 = torch.tanh(self.hid2(z1)) # z2 is tanh function applied to result of hid2 on z1
y = self.oupt(z2) # y is result of oupt on z2
return y
normalise
is a helper function to make the columns of an array go from 0 to 1.
def normalise(dataarray):
dataarraynormalise = np.zeros_like(dataarray)
for j,column in enumerate(dataarray.T):
minvalue = np.min(column)
maxvalue = np.max(column)
print('min',minvalue,'max',maxvalue)
dataarraynormalise[:,j] = np.divide(np.subtract(dataarray[:,j],minvalue),maxvalue - minvalue)
return(dataarraynormalise)
loaddata
is a function to load a csv file and splits into inputs and outputs
def loaddata(filename):
data = np.loadtxt(filename, delimiter = ",",skiprows = 1) # load data (skip first line as this is headings)
columns_train_x = [0,1,2,3,4,5] # these columns are inputs
columns_train_y = [6] # this column is the output
np.random.shuffle(data) # radomise order of the rows
ndata = len(data) # total number of rows
nsplit = int(ndata*0.8) # index at 80 percent of rows
train_x = np.array(data[:nsplit,columns_train_x]) # train on first 80 percent of rows
train_y = np.array(data[:nsplit,columns_train_y])
test_x = np.array(data[nsplit:,columns_train_x]) # test on last 20 percent of rows
test_y = np.array(data[nsplit:,columns_train_y])
return(train_x,train_y,test_x,test_y)
torch.manual_seed(1); np.random.seed(1) # to make results repeatable, we make sure that the random number generators in pytorch and numpy are set to the same seed before we train the NN
train_x,train_y,test_x,test_y = loaddata('data/exampledatasetNN.csv') # load the data file
train_x = normalise(train_x) # normalise the input data
text_x = normalise(test_x)
We now create out network net
and set the error function and optimiser method.
net = Net() # net is an instance of our new nn Class, "Net"
net = net.train() # set the mode of net to be "train"
bat_size = 50 # this is our batch size. We will load 50 rows of trainig data at a time.
loss_func = torch.nn.MSELoss() # mean squared error
optimizer = torch.optim.Adam(net.parameters(), lr=0.01) # use the built in Adam optimiser. This is a related to Stochastic Gradient Descent, but adapts the learning rate for each trainable parameter
Train the network:
n_items = len(train_x) # total number of rows in training set
batches_per_epoch = n_items // bat_size # number of batches in training set ( // rounds down to nearest integer)
max_batches = 1000 * batches_per_epoch
print("Starting training")
for b in range(max_batches):
curr_bat = np.random.choice(n_items, bat_size,replace=False) # current batch is a random sample of row indices
X = torch.Tensor(train_x[curr_bat]) # the current inputs X
Y = torch.Tensor(train_y[curr_bat]).view(bat_size,1) # the current true outputs (reshaped) Y
optimizer.zero_grad() # set the optimiser gradients to zero
oupt = net(X) # run our net on X and obtain oupt
loss_obj = loss_func(oupt, Y) # compute the loss objective based on current oupt and truth (Y)
loss_obj.backward() # compute the gradients of the loss objective with respect to all the trainable parameters in our NN using back propagation
optimizer.step() # update the trainable parameters
print(b,loss_obj)
print("Training complete \n")
Assess our model by plotting the actual $y$ against the predicted $y$ from the network. We also compute the coefficient of determinaton ($R^2$ score).
X = torch.Tensor(test_x)
y = net(X)
y = y.detach().numpy() # convert output from torch tensor to numpy array for plotting
plt.scatter(test_y, y, color='red',s=4,label='Test set')
print("Test set R^2: " + str(r2_score(test_y, y)))
X = torch.Tensor(train_x)
y = net(X)
y = y.detach().numpy()
plt.scatter(train_y, y, color='black',s=2,alpha=0.2,label='Training set')
print("Training set R^2: " + str(r2_score(train_y, y)))
plt.xlabel("actual")
plt.ylabel("predicted")
plt.legend();
We mention here two challenges associated with neural networks.
The first is interpretability; although we can quantify how well a network fits the data, it is not easy to assess how or why the fit has been achieved.
The second is over-fitting; we must always check that the network we have built is reliable and has not been over-fitted to our particular training sample. It is good practice to reserve a portion of the data (20 percent is typical) for testing only (i.e. this data is not used in training). In cross-validation, the model is built $k$ times using random sample sets of the training data. The parameters of the model are then set to the average of the parameters found in the $k$ models. A new model, with these averaged parameters, is then tested on the witheld portion of the data.